Skip to content

Conversation

@zkh2016
Copy link
Contributor

@zkh2016 zkh2016 commented Aug 17, 2021

PR types

New features

PR changes

OPs

Describe

Fused elementwise_add, dropout and elementwise_add into one operator

//before fusion
out1 = elementwise_add(src, bias)
out2 = dropout(out1)
out3 = elementwise_add(residual, out2)
//after fusion
out = fused_residual_dropout_bias(src, residual, bias)

@paddle-bot-old
Copy link

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

@CLAassistant
Copy link

CLAassistant commented Aug 17, 2021

CLA assistant check
All committers have signed the CLA.

@xingfeng01
Copy link
Contributor

LGTM

xingfeng01
xingfeng01 previously approved these changes Aug 24, 2021
xingfeng01
xingfeng01 previously approved these changes Aug 25, 2021
@xingfeng01
Copy link
Contributor

LGTM

xingfeng01
xingfeng01 previously approved these changes Aug 26, 2021
@xingfeng01
Copy link
Contributor

LGTM

Copy link
Contributor

@Xreki Xreki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

代码实现上可以考虑能不能进一步封装,加强代码重用。

* @brief the fused function called by every thread
*/
template <typename T, typename MaskType, typename U, int VecSize,
bool layer_norm>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

layer_norm -> ComputeLayerNorm

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

const platform::CUDADeviceContext &ctx) {
// dropout_prob == 1.0f
if (std::abs(dropout_prob - 1.0f) < 1e-5) {
PADDLE_ENFORCE_CUDA_SUCCESS(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

memory::Copy,另外dstresidual有没有可能是同一个地址?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

PADDLE_ENFORCE_CUDA_SUCCESS(
cudaMemcpyAsync(dst, residual, rows * cols * sizeof(T),
cudaMemcpyDeviceToDevice, ctx.stream()));
PADDLE_ENFORCE_CUDA_SUCCESS(cudaMemsetAsync(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

math::SetConstant

const int VecSize = 4;
if (dbias != nullptr) {
int real_vec_size = VecSize;
if (cols % VecSize != 0) real_vec_size = 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

不要写到一行

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


template <typename T>
struct TestFusedResidualDropoutBias {
uint32_t _rows;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

类的成员变量命名用xxx_的方式,struct可不加_后缀

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

#include <curand_kernel.h>

#include <iostream>
#include <memory>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这2个头文件没有用到吧?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -0,0 +1,70 @@
/* Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fused_dropout.h文件名不合适。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

改成fused_dropout_common.h了

const platform::CUDADeviceContext &ctx, const uint64_t n) {
const uint64_t tmp_n = n / VecSize;
int threads = std::max(
(uint64_t)32, std::min(tmp_n, (uint64_t)ctx.GetMaxThreadsPerBlock()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

类型转换用static_cast

/**
* get 1D threads and blocks
*/
template <int VecSize = 4>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里VecSize可以通过参数传,另外可以考虑使用中的数据结构和接口:

struct GpuLaunchConfig {
dim3 theory_thread_count = dim3(1, 1, 1);
dim3 thread_per_block = dim3(1, 1, 1);
dim3 block_per_grid = dim3(1, 1, 1);
int compute_capability = 0;
};
inline GpuLaunchConfig GetGpuLaunchConfig1D(
const platform::CUDADeviceContext& context, int64_t element_count,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

}

// aligned vector generates vectorized load/store on CUDA
template <typename T, int VecSize>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

没有必要重复定义,也可以直接引用这里的实现:

template <typename T, int Size>
struct alignas(sizeof(T) * Size) CudaAlignedVector {
T val[Size];
};

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

*
*/

/********Forward**************/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

L27 - L32的注释没什么意义,建议删除。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


namespace platform = paddle::platform;
namespace cg = cooperative_groups;
namespace memory = paddle::memory;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个namespace别名没有必要,代码中本来就可以直接用platform::float16memory::Copy

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

if (bias != nullptr)
*bias_value = *reinterpret_cast<const LoadT *>(&bias[col_id]);

float4 rand = curand_uniform4(state);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

如果VecSize为1,这里是不是会有问题,或者性能不好?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里无论VecSize是几,都产生4个随机数,也不太合理。

Copy link
Contributor Author

@zkh2016 zkh2016 Sep 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已改

}
}

/********Backward**************/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

不要这种注释。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

* 2. save 128*8 temporary sum in 8*128 shared memory
* 3. reduce the sum of 128 rows data by 8*VecSize warps
*/
template <typename T, typename MaskType, int BLOCK_SIZE_X, int BLOCK_SIZE_Y,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BLOCK_SIZE_X -> BlockSizeX

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

}
}

// save temporary sum to cache and do transpose
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这是为了做Block级别的reduce吗,可以调用下面的实现吗:

template <typename T, typename ReduceOp>
__device__ __forceinline__ T BlockXReduce(T val, ReduceOp reducer) {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

有点不太一样,我这边是一个block里面做8路的reduce:block大小是8*128个线程,每128个线程做一列数据的reduce,输出8个sum

};

TEST(FusedDropout, GPUFusedResidualDropoutBias) {
const int rows = 16;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

一些不同的测试配置,可以通过for循环写到一个TEST里面。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@Xreki Xreki requested a review from zhangting2020 September 1, 2021 08:51
@Xreki
Copy link
Contributor

Xreki commented Sep 1, 2021

@zhangting2020 也来帮忙review一下吧。

is_test = false;
hasbias = true;
platform::DeviceContextPool &pool = platform::DeviceContextPool::Instance();
auto devicectx = pool.Get(place);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

命名建议:
hasbias->has_bias
devicectx->device_ctx

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

rows, cols, dropout_prob, is_upscale_in_train, src, residual, bias,
dst);
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

上面这段函数的调用,if (cols % VecSize != 0) 的判断,看上去只影响了VecSize的设置,有更好的方式解决这种重复的代码吗?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VecSize是个模板参数,没找到其他方法。

if (bias != nullptr)
*bias_value = *reinterpret_cast<const LoadT *>(&bias[col_id]);

float4 rand = curand_uniform4(state);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里无论VecSize是几,都产生4个随机数,也不太合理。


/**
* @brief call paddle dropout op
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个test是测试原来的dropout op吗?我看到还有一个fuse_dropout_op的test。所以没太明白这个单测的用意?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个头文件是给几个dropout相关的单测共用的,里面call了下dropout_op,作为对比的base版本。

Copy link
Contributor

@Xreki Xreki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. 一些代码优化建议,可以后续PR再看下。

*/
inline platform::GpuLaunchConfig Get1DBlocksAnd2DGrids(
const platform::CUDADeviceContext &ctx, const uint32_t rows,
const uint32_t cols, const int VecSize) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

变量名命名用xxx_xxx方式:VecSize -> vec_size

}

__forceinline__ __device__ void RandVec(curandStatePhilox4_32_10_t *state,
float *data, const int VecSize) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

同上。

return config;
}

__forceinline__ __device__ void Rand1(curandStatePhilox4_32_10_t *state,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我觉得写成模板函数、再特化的方式会好一些。

T factor = static_cast<T>(1.0f / (1.0f - dropout_prob));
if (!is_upscale_in_train) {
factor = static_cast<T>(1.0f);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

L109 - L112可以用? xx : xx运算符写成一行。

factor = static_cast<T>(1.0f - dropout_prob);
if (is_upscale_in_train) {
factor = static_cast<T>(1.0f);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

L114 - L117可以用? xx : xx运算符写成一行。

test.has_bias = has_bias[j];
test.Run();
test.CheckOut(default_diff);
if (!is_fp16) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fp16不检查梯度吗?

T default_diff = static_cast<T>(1e-5);
if (is_fp16) {
default_diff = static_cast<T>(1e-2);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

L258 - 260也可简化成一行。

default_diff = static_cast<T>(1e-2);
}
for (int i = 0; i < cols_list.size(); i++) {
for (int j = 0; j < 2; j++) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

此处可写成如下:

for (auto col : {16, 17}) {
  for (auto has_bias : {true, false}) {
    ...
  }
}

TEST(FusedDropout, GPUFusedResidualDropoutBias3) {
const int rows = 16;
const int cols = 16;
TestFusedResidualDropoutBias<float> test(rows, cols, 0, 1.0, true, false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GPUFusedResidualDropoutBias2和GPUFusedResidualDropoutBias3也只是一个参数设置的差别,都可以合并写到一块。

}

// test large shape
TEST(FusedDropout, GPUFusedResidualDropoutBias6) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

单测名也具化一下,表明实际测的是什么。

@Xreki Xreki merged commit cf8bf03 into PaddlePaddle:develop Sep 9, 2021
AnnaTrainingG pushed a commit to AnnaTrainingG/Paddle that referenced this pull request Sep 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants